JMIR Medical Informatics — Latest Matching Preprints

1

Development and Temporal Evaluation of Multimodal Machine Learning Models to Predict High Inpatient Opioid Exposure

Kale, S.; Singh, D.; Truumees, E.; Geck, M.; Stokes, J.

2026-04-02 health informatics 10.64898/2026.03.31.26349842 medRxiv

Top 0.1%

8.3%

Show abstract

High inpatient opioid exposure is associated with increased risk of persistent opioid use. Early identification of high-risk patients may improve opioid stewardship. We developed machine learning models to predict high opioid exposure during hospitalization using electronic health record data from MIMIC-IV. We conducted a retrospective study of 223,452 unique first hospital admissions in MIMIC-IV. The outcome was high opioid exposure, defined as the top decile among opioid-exposed admissions (MME/day [≥] 225), representing 2.65% of all admissions. Structured early-admission features included demographics, admission characteristics, laboratory utilization and abnormality summaries, and 24-hour procedural indicators. Discharge-note data were incorporated using ClinicalBERT embeddings and interpretable bigram features. Models were trained using an 80/10/10 split and evaluated with temporal validation on the most recent 10% of admissions. Performance was assessed using ROC-AUC and PR-AUC with 95% confidence intervals. Among structured-only models, XGBoost achieved the best test performance (ROC-AUC 0.932 [0.924-0.940]; PR-AUC 0.223 [0.193-0.262]). The combined structured and notes model improved precision-recall performance (ROC-AUC 0.932 [0.920-0.943]; PR-AUC 0.276 [0.229-0.331]). Temporal evaluation showed similar discrimination (ROC-AUC 0.929; PR-AUC 0.223). High-risk bigrams included procedural terms such as "external fixation" and "cervical discectomy." Integration of structured and text-derived features improved risk stratification compared to structured data alone. Interpretable bigram signals reflected procedural complexity and orthopedic pathology, reinforcing the clinical plausibility of model predictions. Multimodal EHR-based models accurately predict high inpatient opioid exposure and may support targeted opioid stewardship during hospitalization.

2

DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant

Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.

2026-04-01 health informatics 10.64898/2026.03.31.26349817 medRxiv

Top 0.1%

7.2%

Show abstract

Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI

3

Electronic Health Record-Based Estimation of Kansas City Cardiomyopathy Questionnaire Scores in Heart Failure

Kim, Y. W.; Lau, W.; Patel, N.; Kendrick, K.; Wu, A.; Feldman, T.; Ahern, R.; Oka, A.

2026-04-05 health informatics 10.64898/2026.04.03.26350138 medRxiv

Top 0.1%

6.4%

Show abstract

Background: The Kansas City Cardiomyopathy Questionnaire (KCCQ) is a validated patient-reported outcome measure for heart failure. However, its clinical utility is limited by incomplete and inconsistent data collection. We aimed to develop and validate machine learning models to estimate KCCQ overall summary scores from electronic health record (EHR) data. Methods: We assembled a retrospective cohort of 10,889 heart failure patients with recorded KCCQ scores from the Truveta database. Predictor features were derived from structured EHR variables across 13 historical time windows (15-360 days). Multiple regression algorithms were evaluated, followed by SHapley Additive exPlanations (SHAP)-based feature reduction and nested cross-validation for hyperparameter optimization. Model performance was assessed using the coefficient of determination (R2), mean absolute error (MAE), and ordinal discrimination and calibration for categorical severity classification. Results: Histogram-based gradient boosting (HGB) with HGB-SHAP feature selection achieved the strongest performance, reducing feature dimensionality by more than 94\% while maintaining estimation accuracy. The 240-day window performed best (R2=0.522, MAE=12.485). For categorical severity classification, the model demonstrated strong ordinal discrimination (mean ordinal AUROC=0.850). Quantile-based calibration improved classification balance, increasing the F1-score for the most severe category (KCCQ<25) from 0.180 to 0.428 and the quadratic weighted kappa from 0.601 to 0.640. Longer EHR observation windows were associated with improved prediction performance. Conclusion: Machine learning models can estimate KCCQ scores from routine EHR data with clinically meaningful accuracy and strong discriminatory performance. This approach may help extend assessment of patient-reported health status to populations in which survey-based data are incompletely captured, supporting population-level cardiovascular outcomes assessment and risk stratification in heart failure care.

4

Nationwide Prediction of Missed and Cancelled Appointments Using Real-World EHR Data

Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.

2026-04-13 health informatics 10.64898/2026.04.08.26349942 medRxiv

Top 0.2%

4.9%

Show abstract

ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [≥]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI

5

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.2%

4.7%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

6

Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry: Methodology and Benchmark

Challier, V.; Jacquemin, C.; Diebo, B.; Dehouche, N.; Denisov, A.; Cristini, J.; Campana, M.; Castelain, J.-E.; Lonjon, G.; Lafage, V.; Ghailane, S.; SpineDAO Collaborative Group,

2026-04-11 health informatics 10.64898/2026.04.07.26350316 medRxiv

Top 0.2%

4.5%

Show abstract

BackgroundSynthetic data have emerged as a complementary strategy for secondary use of clinical registries, enabling data sharing without patient-level exposure. In spine surgery, multicenter data sharing is constrained by institutional governance and patient privacy regulations. Validated synthetic data generation may enable broader access to surgical outcomes data for artificial intelligence development without compromising patient confidentiality. ObjectiveTo describe and benchmark a three-domain validated synthetic data pipeline applied to a multicenter, tokenized spine surgery registry (SpineBase), and to establish a reproducible certification framework for synthetic spine surgery datasets. MethodsWe extracted 125 sacroiliac joint fusion cases from the SpineBase registry (SIBONE study, IRB-SOFCOT approval Ref. 14-2025; CNIL MR-004 Ref. 2234503 v 0). A GaussianCopula generative model was trained on 52 structured variables spanning demographics, preoperative assessments, operative details, and longitudinal outcomes at 3, 6, 12, and 24 months. Synthetic datasets of 100, 1,000, and 10,000 patients were generated. Validation followed a three-domain framework: (1) fidelity, assessed by Kolmogorov-Smirnov tests and Jensen-Shannon divergence; (2) utility, assessed by train-on-synthetic, test-on-real (TSTR) methodology; and (3) privacy, assessed by nearest-neighbor distance ratio (NNDR), membership inference attack, and k-anonymity proxy. ResultsAll three validation gates passed. Fidelity: mean KS p-value 0.52 (threshold >0.05). Privacy: NNDR >1.0 in 98.9% of synthetic records; membership inference AUROC 0.57. Utility: 12-month Oswestry Disability Index prediction yielded Pearson r = 0.29, consistent with expected attenuation at N = 125. A SHA-256 cryptographic hash of each certified dataset was anchored on the Solana blockchain for immutable provenance. ConclusionsA validated, blockchain-anchored synthetic data pipeline for spine surgery registries is technically feasible and meets current publication-standard criteria for fidelity and privacy. Utility metrics scale with registry size, creating a direct incentive for multicenter data contribution. This framework provides a reproducible methodology for synthetic data certification in spine surgery research, and establishes certified synthetic datasets as a privacy-native substrate for expert-annotation pipelines -- as demonstrated in the companion Spine Reviews study.

7

Trade-offs in emergency transport protocols for access to hip fracture management: a geospatial analysis of selective versus standard transfer in Ontario long-term care

Yee, N. J.; Chen, T.; Huang, Y. Q.; Whyne, C.; Halai, M.

2026-04-14 orthopedics 10.64898/2026.04.12.26350713 medRxiv

Top 0.2%

4.1%

Show abstract

Objectives: For suspected hip fractures, prehospital protocols directing patients to an orthopaedic centre rather than the nearest emergency department (ED) could reduce time-to-surgery but may impact EMS travel burden. This study evaluates the impact of transfer protocols by quantifying transport to hospitals from long term care (LTC) facilities across Ontario. Methods: A retrospective cross-sectional analysis of all Ontario LTC facilities and hospitals was performed. Two protocols were modeled: standard transfer to the nearest ED with subsequent transfer if required, and selective transfer based on Collingwood Hip Fracture Rule prehospital screening1 directly to the nearest orthopaedic services (orthoED). Median one-way travel distances were calculated from Google Maps. Results: In Ontario, 15.4% of LTC residents require hospital destination decisions because their nearest ED lacks orthopaedic services; for these facilities, median distances were 2.7km to the ED and 36.0km to the orthoED. Among the 52 LTC facilities where selective transfer was distance-optimal, it substantially reduced travel for patients with hip fracture (31.1km vs 49.6km; P<.01) while only modestly increasing travel for patients without hip fracture. Where standard transfer was distance-optimal, little travel difference was noted for patients with hip fracture, however false positive screened patients traveled significantly further to an orthoED. Greatest negative consequences of selective transfer lie in the 1.3% of residents living farthest (>100km) from an orthoED. Conclusions: EMS direct transportation to hospitals with orthopaedics may improve hip fracture care but can increase EMS burden due to patients identified falsely as having a hip fracture, particularly in remote communities.

8

Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating

Pandey, A. K.

2026-04-06 health informatics 10.64898/2026.04.03.26350114 medRxiv

Top 0.3%

4.0%

Show abstract

Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.

9

MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv

Top 0.3%

3.9%

Show abstract

Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

10

BSO-AD: An Ontology for Representing and Harmonizing Behavioral Social Knowledge in ADRD

Li, H.; Yu, Y.; Bhandarkar, A.; Kumar, R.; Clark, I. H.; Hu, Y.; Cao, W.; Zhao, N.; LI, F.; Tao, C.

2026-03-31 health informatics 10.64898/2026.03.30.26349756 medRxiv

Top 0.3%

3.9%

Show abstract

Objective: Behavioral and social factors (BSFs) substantially influence the risk, onset, and progression of Alzheimer disease and related dementias (ADRD). A systematic representation of their interplay is essential for advancing prevention and targeted interventions. However, BSF-related knowledge is scattered across heterogeneous sources, limiting scalable evidence synthesis and computational analysis. To address this, we created a Behavioral Social Data and Knowledge Ontology for ADRD (BSOAD) to represent and integrate BSFs with respect to ADRD. Material and Methods: BSOAD was developed following established ontology design principles, prioritizing reuse of existing ontology elements to ensure semantic interoperability. It was built upon the Social Determinants of Health Ontology (SDoHO) and the Drug-Repurposing Oriented Alzheimer Disease Ontology (DROADO). BSF-related classes were enriched with ICD 10 CM Z55 Z65 codes and ADRD related classes with AD Onto. Relationships between BSFs and ADRD were derived through literature mining. Ontology quality was evaluated through Hootation based expert review and an LLM assisted framework assessing structural coverage and semantic coherence. Results: BSO AD contains 2275 classes, 153 object properties, and 49 data properties. Expert review demonstrated strong rational agreement (0.95), with disagreements resolved through discussion. LLM-based evaluation showed high category coverage rates ([≥] 0.97) and robust semantic alignment with the relevant literature (average completeness = 0.79; conciseness = 0.94). Discussion and Conclusion: BSOAD is, to our knowledge, the first ontology to systematically represent BSFs and hierarchically model their interrelationships in ADRD. It establishes a semantic backbone for computational analysis and knowledge integration. The LLM assisted evaluation framework demonstrates the feasibility of scalable, automated ontology assessment.

11

Governance, Accountability and Post-Deployment Monitoring Preferences for AI Integration in West African Clinical Practice: A Mixed-Methods Study

Uzochukwu, B. S. C.; Cherima, Y. J.; Enebeli, U. U.; Okeke, C. C.; Uzochukwu, A. C.; Omoha, A.; Hassan, B.; Eronu, E. M.; Yusuf, S. M.; Uzochukwu, K. A.; Kalu, E. I.

2026-04-01 health informatics 10.64898/2026.03.30.26349782 medRxiv

Top 0.3%

3.9%

Show abstract

Background: The integration of artificial intelligence (AI) into clinical practice holds transformative potential for healthcare in West Africa, but safe deployment requires context-appropriate governance, accountability, and post-deployment monitoring frameworks. This cross-sectional mixed-methods study examined preferences and concerns of West African clinicians and technical experts regarding AI governance structures, post-deployment surveillance mechanisms, and accountability allocation. Methods: A structured questionnaire was administered to 136 physicians affiliated with the West African College of Physicians (February 22-28, 2026), complemented by 72 key informant interviews with technical leads, AI developers, data scientists, policymakers, and healthcare leaders. Data were analyzed using descriptive statistics, inferential tests, and thematic analysis. Results: Clinicians strongly preferred independent regulatory bodies (40.4%) for overseeing AI tool performance, with high trust ratings (mean:4.3/5), while vendor self-monitoring received minimal support (3.7%, mean:2.4/5). Real-time dashboards were the most favored monitoring approach (41.9%). Clear accountability pathways (94.1%), algorithm transparency (91.9%), and real-time performance data (89.7%) were rated essential by majorities. Major concerns included clinicians being unfairly blamed for AI errors (76.5%), excessive vendor control (72.8%), and absence of clear reporting pathways (69.9%). Qualitative findings emphasized continuous performance tracking for accuracy, fairness, and bias; structured incident reporting; protocols for model drift and failure; and multi-layered governance combining independent oversight, institutional AI committees, and explicit liability frameworks. Conclusion: This study provides the first empirical evidence from West Africa on clinician preferences for AI governance. Findings offer actionable guidance for policymakers to build trustworthy, equitable, and safe AI integration frameworks that prioritize transparency, independent oversight, and clinician protection. Keywords: artificial intelligence; AI governance; post-deployment monitoring; accountability; West Africa; clinician preferences; health data science.

12

Research Paper on AuditMed: A Single-File, Browser-Based Clinical Evidence Audit Platform Architecture, Current Capabilities, and Proposed Applications in Drug Informatics and Pharmacy Education

Ferguson, D. J.

2026-04-20 health informatics 10.64898/2026.04.19.26351188 medRxiv

Top 0.3%

3.7%

Show abstract

BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [->] Select [->] Parse [->] Analyze [->] Infer [->] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.

13

Spine-Related Health Care Utilization and Costs Following Orthobiologic Injection Versus Lumbar Surgery for Degenerative Spine Conditions

Lentz, T.; Burrows, J.; Brucker, A.; Wong, A. I.; Qualls, L.; Divakaran, R.; Centeno, C.; Suther, T.; Thomas, L.

2026-04-02 orthopedics 10.64898/2026.03.31.26349877 medRxiv

Top 0.3%

3.6%

Show abstract

Background Lumbar fusion and decompression procedures are widely used for degenerative spine conditions but are associated with substantial health care costs and variable outcomes. Orthobiologic treatments, including platelet rich plasma (PRP) and bone marrow aspirate concentrate (BMAC), have emerged as less invasive options for select patients who meet surgical criteria. However, concerns remain that orthobiologic care may delay rather than avert surgery, potentially increasing downstream utilization and costs. Comparative evidence on real world utilization and costs is limited. Methods We conducted a retrospective, observational study using linked commercial insurance claims and a national orthobiologic treatment registry. Adults with lumbar degenerative disc disease (DDD) who met criteria for lumbar fusion or laminectomy, foraminotomy, discectomy, and facetectomy (LFDF) procedures, and who received PRP injection (with or without BMAC) or surgery between 2016 and 2023 were included. Two comparisons were evaluated: PRP versus lumbar fusion and PRP versus lumbar decompression procedures. Propensity score matching was used to balance cohorts on demographic characteristics, comorbidities, spine related diagnoses, prior health care use, and severity proxies. Outcomes included spine-related health care resource use and aggregate costs at 12 and 24 months, with exploratory analyses at 36 and 48 months. Costs were estimated using multiple approaches, including Medicare based estimates and commercial payer methods. Results After matching, 133 patients receiving PRP were compared with 2,560 patients undergoing fusion, and 198 patients receiving PRP were compared with 3,960 patients undergoing LFDF. Rates of subsequent spine surgery following PRP were low and below cell suppression thresholds through 24 months, with similar findings in exploratory longer-term analyses. Compared with surgical cohorts, patients receiving PRP had lower rates of postoperative imaging, home health services, and outpatient visits, with no consistent differences in opioid use, magnetic resonance imaging, or physical therapy. At 12 and 24 months, mean aggregate costs were significantly higher for fusion and LFDF cohorts across most costing methods. Cost differences were largest for fusion comparisons and were driven primarily by index procedure costs and higher reoperation and imaging rates in surgical cohorts. Findings were generally consistent across sensitivity and exploratory analyses. Conclusions Among select patients with degenerative spine conditions who meet surgical criteria, PRP was associated with lower health care utilization and substantially lower costs compared with lumbar fusion or LFDF, without evidence of increased progression to surgery. These findings support consideration of orthobiologic options for appropriately selected patients when surgery is not the only viable treatment option. Limitations include selection bias, absence of patient reported outcomes, and claims-based severity measures.

14

Spine Reviews: Crowdsourcing Global Spine Expert Knowledge via Digital Ledger Technology

Challier, V.; Diebo, B.; Lafage, V.; Dehouche, N.; Lonjon, G.; Cristini, J.; SpineDAO,

2026-04-13 health informatics 10.64898/2026.04.11.26350678 medRxiv

Top 0.3%

3.6%

Show abstract

Study Design: Prospective observational study using a novel digital ledger technology (DLT)-based crowdsourcing platform. Objective: To develop and evaluate Spine Reviews, a blockchain-based platform for aggregating spine treatment recommendations from an international specialist panel, and to validate the clinical coherence of the resulting dataset. Summary of Background Data: Predictive models for low back pain treatment are limited by small, homogeneous datasets that fail to capture inter-clinician variability. Traditional multi-center data collection is expensive, slow, and geographically constrained. DLT-based crowdsourcing with cryptographic credentialing may overcome these barriers. Methods: Five hundred synthetic patient vignettes (digital twins) were generated; 463 retained after quality control. A review platform was built on the Solana blockchain using non-transferable Soulbound Tokens (SBTs) for credentialing and smart-contract compensation. Fifty-two specialists from 7 countries provided 4+ reviews per vignette across four treatment tiers, without access to imaging or physical examination. Mixed-effects regression with reviewer random intercepts partitioned decision variability. Results: The platform collected 2,066 completed reviews (97.7%) over 37 days at USD 0.97/review. Variance decomposition revealed that 36.7% of treatment tier variability was attributable to patient presentation, 19.2% to reviewer practice style, and 44.1% to their interaction. Neurological deficits (beta=0.39), symptom duration (beta=0.12), and pain (beta=0.09) independently predicted treatment escalation (all p<0.001). Gwet's AC1 was almost perfect for emergency (0.92) and substantial for conservative decisions (0.67). Reviewer confidence in treatment recommendations decreased with escalating tier severity (conservative 4.59/5 vs surgical 4.05/5), suggesting appropriate uncertainty calibration. Conclusions: DLT with SBT credentialing enables rapid, global, cost-effective aggregation of clinically coherent expert judgment. The three-component variance structure quantifies clinical equipoise in spine care and establishes that predictive models require diverse, multi-reviewer training data. Keywords: digital ledger technology; blockchain; crowdsourcing; clinical decision-making; low back pain; Soulbound Tokens

15

Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

Vasquez-Venegas, C.; Chewcharat, A.; Kimera, R.; Kurtzman, N.; Leite, M.; Woite, N. L.; Muppidi, I. J.; Muppidi, R. J.; Liu, X.; Ong, E. P.; Pal, R.; Myers, C.; Salzman, S.; Patscheider, J. S.; John, T. R.; Rogers, M.; Samuel, M.; Santana-Guerrero, J. L.; Yaacob, S.; Gameiro, R. R.; Celi, L. A.

2026-04-07 health informatics 10.64898/2026.04.02.26349884 medRxiv

Top 0.3%

3.6%

Show abstract

Computer vision models for chest X-ray interpretation hold significant promise for global healthcare, but their clinical value depends on equitable development across diverse populations. We conducted a scientometric analysis to examine authorship patterns, geographic distribution, and dataset origins to assess potential disparities that could affect clinical applicability. We systematically reviewed literature on computer vision applications for chest X-rays published between 2017-2025 across multiple databases, including PubMed, Embase and SciELO databases. Using Dimensions API and manual extraction, we analyzed 928 eligible studies, examining first and senior author affiliations, institutional contributions, dataset provenance, and collaboration patterns across different income classifications based on World Bank categories. High-income countries dominated research leadership, representing 55.6% of first authors and 59.7% of senior authors; no first authors were affiliated with low-income countries. China (16.93%) and the United States (16.72%) led in first authorship positions. Most datasets (73.6%) originated from high-income settings, with the United States being the largest contributor (40.45%). Private datasets were most frequently used (20.52%). Cross-income collaborations were rare, with only 3.9% of publications involving partnerships between high-income and lower-middle-income countries. Findings reveal substantial disparities in who shapes computer vision research on chest X-rays and which populations are represented in training data. These imbalances risk developing AI systems that perform inconsistently across diverse healthcare settings, potentially exacerbating healthcare inequities. Addressing these disparities requires coordinated efforts to develop globally representative datasets, establish equitable international collaborations, and implement policies that promote inclusive research practices.

16

Perception of Safety in Behavioral Health Crisis Units among Patients and Care Partners versus Artificial Intelligence (AI): A Multimethod Study

Jafarifiroozabadi, R.

2026-04-07 health informatics 10.64898/2026.04.06.26350257 medRxiv

Top 0.3%

3.6%

Show abstract

Background: Safety is a critical concern in behavioral health crisis units (BHCUs), where environmental risks (e.g., ligature points) can lead to injury to self or others. However, limited research has examined how perceived safety influences facility selection among patients and care partners, or how these perceptions align with AI-driven safety risk assessments in such environments. Method: To address these gaps, a nationwide discrete choice online survey was conducted using image-based scenarios of BHCU environments, where participants selected preferred facilities based on a range of attributes, including environmental safety risks (e.g., ligature points). Additionally, participants identified safety risks in survey images, which were compared with outputs from an AI-driven tool developed and trained to detect environmental risks by experts. Quantitative analysis using conditional logit models examined the influence of attributes on facility choice, while spatial comparisons of annotated images and heatmaps assessed participant and AI-identified risk alignments. Results: Findings revealed that the higher frequency of safety risks in images significantly reduced the likelihood of facility selection (p < .001, OR {approx} 1.28), highlighting the importance of perceived safety in user decision-making. While there was notable alignment between heatmaps generated by participants and AI, key differences emerged, suggesting that participant safety perception was influenced by features not fully captured by AI, such as the type of materials or unknown, out-of-label safety risks in facility images. Conclusions: Despite these limitations, results highlighted the value of integrating AI-driven assistive tools for non-expert user safety risk assessment to support decision-making for safer BHCU environments.

17

Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv

Top 0.3%

3.5%

Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

18

Early Identification of Hospital Visit Risk in Heart Failure Using Wearable-Derived Data

Ivezic, V.; Dawson, J.; Doherty, R.; Mohapatra, S.; Issa, M.; Chen, S.; Fonarow, G. C.; Ong, M. K.; Speier, W.; Arnold, C.

2026-03-27 health informatics 10.64898/2026.03.26.26349411 medRxiv

Top 0.4%

3.4%

Show abstract

Objectives: Heart failure is a leading cause of mortality, necessitating identification of patients at increased risk needing intervention. In this study, we investigated if Fitbit data can reveal physiological trends associated with hospital visit risk. Materials and methods: Individuals with heart failure (n=249) were randomized into three arms for prospective 180-day monitoring. All arms received a Fitbit and wireless weight scale. Arm 1 received devices only; Arm 2 received a mobile app with surveys; Arm 3 received the app plus financial incentives. Results: 51 participants had hospital visits during the study period. These individuals took fewer steps (p=.002) and reported increased symptom severity (p=.044). Resting heart rate increased three days prior to a visit (p=.022). Baseline steps revealed a higher visit probability for less active participants (p=.003). Discussion and conclusion: Passive physiological monitoring can effectively identify individuals at risk of health exacerbation, demonstrating the potential of wearable devices for timely clinical intervention.

19

Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models

Wang, X.; Faviez, C.; Vincent, M.; Andrew, J. J.; Le Priol, E.; Saunier, S.; Knebelmann, B.; Zhang, R.; Garcelon, N.; Burgun, A.; Chen, X.

2026-03-31 health informatics 10.64898/2026.03.30.26349388 medRxiv

Top 0.4%

3.3%

Show abstract

Objectives Rare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases. Methods As a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication. Results All SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes. Discussion Lightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training. Conclusion SLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

20

A case report on gendered biases in a Finnish healthcare AI assistant

Luisto, R.; Snell, K.; Vartiainen, V.; Sanmark, E.; Äyrämö, S.

2026-04-14 health informatics 10.64898/2026.04.09.26350383 medRxiv

Top 0.4%

3.2%

Show abstract

In this study, we investigate gender bias in a Retrieval-Augmented Generation (RAG) based AI assistant developed for Finnish wellbeing services counties. We tested the system using 36 clinically relevant queries, each rendered in three gendered variants (male, female, gender-neutral), and evaluated responses using both an LLM-as-a-judge approach and a human expert panel consisting of a physician and a sociologist specializing in ethics. We observed substantial and clinically significant differences across gendered variants, including differential treatment urgency, inappropriate symptom associations, and misidentification of clinical context. Female variants disproportionately framed responses around childcare and reproductive health regardless of clinical relevance, reflecting societal stereotypes rather than medical reasoning. Bias manifested both at the LLM generation stage and the RAG retrieval stage, in several cases causing the model to hallucinate responses entirely. Some bias patterns were persistent across repeated runs, while others appeared inconsistently, highlighting the challenge of distinguishing systematic bias from stochastic variation.